LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

نویسندگان

چکیده

Gradient-based distributed learning in parameter server (PS) computing architectures is subject to random delays due straggling worker nodes and possible communication bottlenecks between PS workers. Solutions have been recently proposed separately address these impairments based on the ideas of gradient coding (GC), grouping, adaptive selection. This article provides a unified analysis techniques terms wall-clock time, communication, computation complexity measures. Furthermore, order combine benefits GC grouping robustness stragglers with load gains selection, novel strategies, named lazily aggregated (LAGC) grouped-LAG (G-LAG), are introduced. Analysis results show that G-LAG best time performance while maintaining low computational cost, for two representative distributions times nodes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Communication-Computation Efficient Gradient Coding

This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients (and more generally vector summations) in terms of three parameters: computation load, straggler tolerance and communication cost. It further gives an explicit coding scheme that achieves the optimal tradeoff based on recursive polynomial...

متن کامل

Gradient Sparsification for Communication-Efficient Distributed Optimization

Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communication cost we propose a convex optimization formulation to minimize the coding...

متن کامل

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a f...

متن کامل

Gradient Coding: Avoiding Stragglers in Distributed Learning

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and ge...

متن کامل

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE transactions on neural networks and learning systems

سال: 2021

ISSN: ['2162-237X', '2162-2388']

DOI: https://doi.org/10.1109/tnnls.2020.2979762